This is my first try to analyse a tHPI dataset. Therefore, some of the decisions and consequently their results seen in this report might not be perfect. Though, I’m continuously working on this report and will improve it iteratively.

Data Processing

I start by importing a recent tHPI which shows the average price in USD for a standard double room, based on 25 of the most popular US cities. I also add the Latitude and Longitude of these 25 cities, which will be used for fancier visualizations. Here is the tHPI for top 6 cities of US.

##           city January February March April May June July August September
## 1     New York     243      245   297   377 410  372  317    320       430
## 2  Los Angeles     197      196   203   219 219  227  252    245       220
## 3      Chicago     154      163   213   259 333  311  299    267       293
## 4       Dallas     148      143   154   158 150  148  145    147       157
## 5 Philadelphia     172      189   207   228 246  244  200    198       247
## 6      Houston     149      157   157   161 173  146  149    143       150
##   October      lat        lon
## 1     430 40.71278  -74.00594
## 2     219 34.05223 -118.24368
## 3     307 41.83690  -87.68470
## 4     163 32.77670  -96.79700
## 5     241 39.95000  -75.16670
## 6     154 29.76040  -95.36980

I reshape the spreadsheet data into a four column data frame, with city, month, price, and date in first, second, third and forth columns, respectively. The first six rows of the new data frame are as following.

##           city   month price       date
## 1     New York January   243 0015-01-01
## 2  Los Angeles January   197 0015-01-01
## 3      Chicago January   154 0015-01-01
## 4       Dallas January   148 0015-01-01
## 5 Philadelphia January   172 0015-01-01
## 6      Houston January   149 0015-01-01

A few minor data processing tasks will be done later during the EDA and modeling phases. Next, I’ll conduct Exploratory Data Analysis (EDA) for getting more insights into the dataset.

Exploratory Data Analysis (EDA)

The following three plots, show the density of hotel prices.

According to the first plot, most of the hotel prices are around $150, however this price can go up to around $450. Based on the second plot, in January, most of the cities have the $150 price while in e.g., May the prices are much higher. The third plot, decomposes the price distribution of different cities. Which is hard to analyse as it is. Therefore, I continue with the following two boxplots.

Actually, the above two plots became my favorites. The former one, illustrates the high-season. The average price starts increasing in April and is maximum in July. Then drops in August.

The latter plot, immediately shows different type of cities. Boston and New York have very high prices, and their prices vary very much. While, Houston or Orlando are cheaper and much more stable.

The following heatmap illustrates how prices change across months and cities.

The lower right corner of the above heatmap, roughly, captures the cities and months which I see the highest prices. In this heatmap, I can also differentiate between cities with stable prices and cities with very varying prices.

Here I try to make categories for cities. First, I categorize each city as “Cheap Cities”, “Moderate Cities” and “Expensive Cities”. Second, I categorize them as “Stable Cities” and “Unstable Cities”. Before this categorization, let me extract the mean, standard deviation and range of prices from January to October, for each city. Following comes these statistics for the top 6 cities of US.

##           city price.mean  price.sd price.range price.mean.class
## 1     New York      344.1 70.345891         187        (284,359]
## 2  Los Angeles      219.7 18.541545          56        (210,284]
## 3      Chicago      259.9 62.939918         179        (210,284]
## 4       Dallas      151.3  6.429965          20        (135,210]
## 5 Philadelphia      217.2 27.336585          75        (210,284]
## 6      Houston      153.9  8.685237          30        (135,210]
##   price.sd.class price.range.class
## 1    (44.5,82.9]         (126,235]
## 2    (6.12,44.5]        (17.8,126]
## 3    (44.5,82.9]         (126,235]
## 4    (6.12,44.5]        (17.8,126]
## 5    (6.12,44.5]        (17.8,126]
## 6    (6.12,44.5]        (17.8,126]

The change of prices in each category are shown in following plots.

Using all the above plots, my intuition is that fluctuation of prices are following certain patterns. I illustrate some patterns, namely up-down-up, up-up-down-up, up-up-up-up, up-up-up-down in the following four subplots.

Based on my observations in above plots, and considering that the traveling patterns follow the climate changes, school holidays (and other yearly and 6-monthly periodic events), I expect that a linear combination of some Sinus and Cosinus functions with periods of 12 months and 6 months can estimate the trend of prices.

Modeling

It seems that every city is following a periodic trend. For this, I define these four predictors: \(sin(\frac{\pi t}{6}),cos(\frac{\pi t}{6}),sin(\frac{\pi t}{3}),cos(\frac{\pi t}{3})\) where \(t\) is the number of month ranging from 1 to 12 for January to December. Then, I use a linear model to predict the price of each month. See following two R functions.

get.predictors <- function(t) {
  return(data.frame(t=t,
           sin = sin(pi/6*t), # 12 month period
           cos = cos(pi/6*t),
           sin2 = sin(pi/3*t), # 6 month period
           cos2 = cos(pi/3*t)
           )
         )
}
get.predictions <- function(price.trend,time.end) {
  lmfit <- lm(y ~ sin+cos+sin2+cos2,data=price.trend)
  prediction.time <- 1:time.end
  return(price.prediction <- data.frame(t=prediction.time,y=predict(lmfit,get.predictors(prediction.time))))
}

Estimating October Prices (Testing)

To test this model I train one model for each city, based on the prices from January to September, and then compare its predictions for October prices with the real values.

After visually testing the predicted prices of October, I realized that for some of the cities the predictions are very promissing, while for some other cities predictions fail.

Some good predictions of October price are shown in following plots (the black dots are real values and the red dots and lines are my predictions).

For some other cities, the predictions don’t work that well. This can be because of unknown predictors that our model doesn’t take into account, or other reasons!

Overall, the predictions are not very bad; the trend of prices is usually followed by the model. Therefore, next, I use my model to predict the prices of November and December.

Estimating November and December Prices (To be Verified)

I use all available prices from January to October to train my model and then I make predictions for November and December prices. The predictions for all 25 US cities are illustrated in following (The black dots and lines show the real prices and red lines show the predictions).

Following comes my predictions for prices in next months. To verify these predictions, we can wait two more months and compare the results.

##               city November December
## 1         New York      418      327
## 2      Los Angeles      201      195
## 3          Chicago      279      219
## 4           Dallas      161      153
## 5     Philadelphia      238      206
## 6          Houston      156      153
## 7  Washington D.C.      321      263
## 8            Miami      193      219
## 9          Atlanta      179      172
## 10          Boston      408      314
## 11   San Francisco      307      260
## 12         Detroit      169      167
## 13         Orlando      130      133
## 14       San Diego      149      144
## 15       Las Vegas      185      170
## 16         Phoenix      132      140
## 17    Indianapolis      127      117
## 18  Salt Lake City      141      134
## 19          Denver      197      186
## 20     New Orleans      219      209
## 21       St. Louis      153      137
## 22         Seattle      185      172
## 23     Minneapolis      232      198
## 24     San Antonio      107      108
## 25        San Jose      192      177

Clustering based on the Periodic Model

While training the models, for each city I get 4 model coefficients. I assume that these coefficients are the features for the cities. This allows me to use a K-means clustering algorithm to see if the coefficients can help me to make clusters for the cities.

A simple model with k=4 clusters have a goodness of 74.4%. The following figure illustrates the 25 cities on a map and each cluster is shown by a color.

The above map already captures some of the geographical and climate characteristics of cities. For instance, Seattle which indeed has a different climate from other cities is the single member of its own cluster. Besides, many cities in central and western parts of the US which have similar climates end up in the same cluster. These clusters should be studied further in the future.

Summary and Conclusion

In this technical report, I started by importing the tHPI data and applied some basic modifications and reshaping to data. Then I conducted EDA to get insights into this dataset. As a result of my EDA I realized that prices follow periodic patterns. Therefore, I decided to use \(Sinus\) and \(Cosinus\) functions to fit a model to data. First, I tested my model by leaving out the October prices from training data and then I visually compared my predictions with the real prices. Then, I trained my model based on all available prices and made predictions for November and December prices. Also, coefficients of my model helped me to summarize each city with four numeric features which I used to cluster the cities.

I emphasize that the analysis provided in this report is very basic, and the choices might be naive in some cases. To improve the results, more domain knowledge, more data and deeper analysis are required. Next section elaborates on some of the possible future improvements.

Future Perspective

If I find more time, I’ll do the following.

  1. Using other tHPI Datasets
  1. Collecting new Data
  1. Feedback loop - We should be careful with the effect of our predictions on future prices. The prices are determined in a negotiating process; by making predictions of future prices we might influence this negotiation process and this can result in emergence of new patterns.
  2. Pricing Anomalies - If I look at the monthly estimations and real world trends, I see some anomalies! For example in Houston the prices in May and June show a significant deviation from the expected periodic pattern. Also, in Atlanta prices of July and August deviate. July of San Diago and San Antonio should be explored more, too.
  3. The Periodic model I trained in this report was tested by leaving out the prices of October and then comparing estimations with real values for this month. However, I didn’t provide a concrete accuracy measure for this model. This can be further explored.